[SPARK-25366][SQL]Zstd and brotli CompressionCodec are not supported for parquet files#22358
[SPARK-25366][SQL]Zstd and brotli CompressionCodec are not supported for parquet files#2235810110346 wants to merge 1 commit intoapache:masterfrom
Conversation
|
but if there are the codecs found, we support those compressions, no? |
docs/sql-programming-guide.md
Outdated
There was a problem hiding this comment.
I prefer none, uncompressed, snappy, gzip, lzo, brotli(need install ...), lz4, zstd(need install ...).
There was a problem hiding this comment.
Installation may not be able to solve it.
There was a problem hiding this comment.
none, uncompressed, snappy, gzip, lzo, brotli(need install brotli-codec), lz4, zstd(since Hadoop 2.9.0)
https://jira.apache.org/jira/browse/HADOOP-13578
https://github.com/rdblue/brotli-codec
https://jira.apache.org/jira/browse/HADOOP-13126
There was a problem hiding this comment.
hadoop-2.9.x is officially supported in Spark?
|
It is using reflection to acquire hadoop classes for compression which are not in the (hadoop-common-2.6.5.jar, hadoop-common-2.7.0.jar, hadoop-common-3.1.0.jar).
|
|
Thanks, if there are the codecs found, we support those compressions, but how do I find it? @HyukjinKwon |
|
That's probably something we should document, or improve the error message. Ideally, we should fix the error message from Parquet. Don't you think? |
|
yeah, the error message is output from external jar(parquet-common-1.10.0.jar), |
|
Test build #95785 has finished for PR 22358 at commit
|
|
If the codecs are found, then we support it. One thing we should do might be to document to explicitly provide the codec but I am not sure how many users are confused about it. |
|
just fyi about related talks: #21070 (comment) |
There was a problem hiding this comment.
I thought if you remove it from here the user would not be able to use zstd or brotli even if it is installed/enabled/available?
There was a problem hiding this comment.
I agree with you, removing is not a good idea.
Thanks.
1db036a to
5c478b9
Compare
|
Test build #95852 has finished for PR 22358 at commit
|
|
I am 0 on this since it is worth |
docs/sql-programming-guide.md
Outdated
There was a problem hiding this comment.
I would just add few lines for brotli and zstd below and leave the original text as is.
5c478b9 to
dd86d3f
Compare
|
Test build #95930 has finished for PR 22358 at commit
|
docs/sql-programming-guide.md
Outdated
There was a problem hiding this comment.
needs install -> needs to install
|
I'm okay but I would close this if no committer agree with (approves) this for some long time. |
dd86d3f to
64aef6b
Compare
|
Test build #95969 has finished for PR 22358 at commit
|
docs/sql-programming-guide.md
Outdated
There was a problem hiding this comment.
@HyukjinKwon How about adding a link? Users may not know where to download it.
`brotliCodec` -> [`brotli-codec`](https://github.com/rdblue/brotli-codec)
There was a problem hiding this comment.
If the link looks expected to be rather permanent, it's fine.
There was a problem hiding this comment.
It is more clear to say "zstd requires ZStandardCodec to be installed".
64aef6b to
39eaf1d
Compare
docs/sql-programming-guide.md
Outdated
|
Test build #96312 has finished for PR 22358 at commit
|
39eaf1d to
0e5d0bc
Compare
|
Test build #96314 has finished for PR 22358 at commit
|
srowen
left a comment
There was a problem hiding this comment.
I think a bit of documentation is OK.

What changes were proposed in this pull request?
Hadoop2.6 and hadoop2.7 do not contain zstd and brotli compressioncodec ,hadoop 3.1 also contains only zstd compressioncodec .
So I think we should remove zstd and brotil for the time being.
set
spark.sql.parquet.compression.codec=brotli:Caused by: org.apache.parquet.hadoop.BadConfigurationException: Class org.apache.hadoop.io.compress.BrotliCodec was not found
at org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:235)
at org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.(CodecFactory.java:142)
at org.apache.parquet.hadoop.CodecFactory.createCompressor(CodecFactory.java:206)
at org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:189)
at org.apache.parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:153)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:411)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:161)
set
spark.sql.parquet.compression.codec=zstd:Caused by: org.apache.parquet.hadoop.BadConfigurationException: Class org.apache.hadoop.io.compress.ZStandardCodec was not found
at org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:235)
at org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.(CodecFactory.java:142)
at org.apache.parquet.hadoop.CodecFactory.createCompressor(CodecFactory.java:206)
at org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:189)
at org.apache.parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:153)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:411)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:161)
How was this patch tested?
Exist unit test